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ABSTRACT 
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of corpus dealt with here is a bibliographical repository, with entries form 
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in the repository should therefore return matching topics from the hierarchy, 
rather than just a list of entries. Likewise, when new entries are inserted, 
a search for relevant topics to which they should be linked is required. The 
study develops a vector-based algorithm that creates keyword vectors for the 
set of competing topics at each node in the hierarchy, and show how its 
performance improves when domain-specific features are added (such as special 
handling of topic titles and author names) . The results of a 7- fold cross 
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ABSTRACT 

BoW is an on-line bibliographical repository based on a hierarchi- 
cal concept index to which entries are linked. Searching in the 
repository should therefore return matching topics from the hierar- 
chy, rather than just a list of entries. Likewise, when new entries are 
inserted, a search for relevant topics to which they should be linked 
is required. We develop a vector-based algorithm that creates key- 
word vectors for the set of competing topics at each node in the 
hierarchy, and show how its performance improves when domain - 
specific features are added (such as special handling of topic titles 
and author names). The results of a 7-fold cross validation on a 
corpus of some 3,500 entries with a 5-level index are hit ratios in 
the range of 89-95%, and most of the misclassifications are indeed 
ambiguous to begin with. 

1. INTRODUCTION 

An obvious and natural approach to organize a large corpus of 
data is a hierarchical index — akin to a book’s table of contents. 
The type of corpus we deal with is a bibliographical repository, 
with entries from a limited domain (our prototype is on “parallel 
systems”). Given such an index, it is desirable that search results 
point to relevant locations in the hierarchy, rather than just provid- 
ing a flat list of entries. This is useful not only to support user 
searching, but also as an aid suggesting possible places to link new 
entries that are inserted into the repository. 

1.1 BoW - Bibliography on the Web 

The goal of the BoW project [9] is to create a convenient environ- 
ment for using and maintaining an on-line bibliographic repository. 
The key idea is that this be a communal effort shared by all the 
users. Thus every user can benefit from the input and experience 
of other users, and can also make contributions. In fact, the system 
tabulates user activity, so merely searching through the repository 
and exporting selected items already contributes to the ranking of 
items in terms of user interest. A prototype implementation is avail- 
able at http://www.bow.cs.huji.ac.il. 

The heart of the BoW repository is a deep (multi-level) hierarchi- 
cal index spanning the whole domain. The nodes in the hierarchy 



are called concept pages . Pages near the top of the hierarchy rep- 
resent broad concepts, while those near the bottom represent more 
narrow concepts. The depth of the hierarchy should be sufficient so 
that the bottommost pages only contain a handful of tightly related 
entries (as opposed to Web search engines and scientific literature 
databases like CORA [5] which contain a relatively shallow direc- 
tory). A subtrees containing all the concept pages reachable from 
a certain (high level) concept page is referred to as a topic . Entries 
can be linked to multiple concept pages, if they pertain to multi- 
ple concepts. Likewise, they can be linked at different levels of the 
hierarchy, depending on their breadth and generality. 

The index is navigated using a conventional browser. Normally 
three frames are available (Fig. 1). The first shows the hierarchical 
index, and the currently selected concept page. The second lists 
entries linked to this concept page, and allows for the selection of 
a specific entry, the third displays the surrogate of the chosen en- 
try, including all the bibliographical data (authors, title, where and 
when published), user annotations, and additional links (e.g. to 
where the full text is available). Available operations on the cur- 
rent entry include marking it for export, adding an annotation, and 
adding links. This includes links from additional concept pages to 
the entry, links between this entry and related entries (e.g. from a 
preliminary version of a paper to the final version), and links to 
external resources such as the full text. 

The index structure is created by the site editor. The vocabulary 
used in the index and annotations is uncontrolled by the system, and 
users also query the system using natural language [2]. Indexing is 
simplified by the fact that we use concise surrogates, rather than 
full text documents [13]. We make up for the reduction in data by 
enlisting users to verify indexing suggestions. Thus, when a user 
introduces a new entry, the system uses the text of the entry as a 
query, and finds concept pages that contain similar entries. But the 
actual decision to link the new entry to these concept pages is left 
to the discretion of the user. 

The indexing described in this paper is based on lexical analy- 
sis of concept pages and entries linked to them. For each topic, 
we create a list of keywords that differentiate it from other topics 
that have the same parent. The indexing then proceeds from the 
root, choosing the most suitable sub-topic(s) at each point. As only 
contending topics are considered, the complexity of the search is 
reduced [14, 20]. 
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1.2 Related Work 

There are three basic approaches for textual documents process- 
ing [15]: lexical, syntactic, and semantic analysis. A number of 
systems using syntactic and semantic analysis have been developed 
and are being used for research, such as DR-LINK [18], CLARIT 
[8] and TREC [7, 3 1]. However, they are typically not significantly 
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Figure 1: Screen dump of BoW showing partially opened hierarchical index. 



better than the best lexical analyzers. We will discuss various lexi- 
cal analyzers throughout the paper, in relation to our work. 

Very little has been done so far on hierarchical indexing. In gen- 
eral, it has been shown that hierarchical indexing methods outper- 
form traditional flat algorithms [20, 14]. However, these studies 
were based on a very wide domain and a relatively shallow hier- 
archy (e.g. two levels), our work, in contrast, requires a very fine 
classification, as the bottom levels of the hierarchy only contain a 
small number of entries each. 

Search and browsing based on a hierarchy was suggested in [24]. 
However, in this case the hierarchy is very strict and depends on 



nested key phrases (e.g. “forest fires” is under “forest”), which al- 
lows it to be automated. We take the opposite approach: the hi- 
erarchy is created by humans so as to capture pertinent concepts, 
and the automation comes in trying to find what characterizes this 
structure. 

2. OFF-LINE PREPARATION OF 
KEYWORD VECTORS 

The hierarchical indexing mechanism consists of two parts. The 
first is an off-line traversal of the whole repository, repeated at reg- 
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Topic 


Number of 
clusters 


Hit ratio 


5-grams 


Whole words 


Cooking recipes 


10 


87% 


53% 


Linux 


16 


85% 


47% 



Table 1: Clustering hits ratio for two given documents collec- 
tions using 5-grams versus whole words. 



ular intervals (e.g. once a day) in order to compute keyword vectors 
for all the topics. The second is a matching scheme that compares 
new entries or queries with these pre-computed keyword vectors. 

The off-line part is executed recursively for every level of the 
index, top-down. The main idea is that each topic encompasses 
all the concept pages in a sub-tree of the index, therefore all of 
them should be taken into account while constructing its keywords 
vector. The group of sibling topics, located at the same level and 
having the same parent in the index are called a competitive topics 
set , since they compete for keywords with each other. The algo- 
rithm generates keywords vectors in five steps: parse all the pages 
in the topic’s sub-tree, merge them into one vector, unify the result- 
ing vectors to include the same words, normalize the weights of 
the words in all the vectors, and choose the most relatively frequent 
ones to represent the corresponding topics. 

2.1 Parsing 

The first stage is parsing the text of concept pages, with the goal 
of creating a vector of all the words in the given concept page [30, 
33], denoted by VoCp ag e • This of course requires us to define 
“word”. 

The natural definition is a completely separated meaningful string. 
This has the well-known disadvantages of treating related words as 
being different, and the well-known solutions such as stemming 
(e.g. [23, 19]). An alternative is to use n-grams (substrings of 
length n of words: for example, “algorithm” will be turned into 
“algor”, “lgori”, “gorit”, “orith”, and “rithm”) [1], We prefer the 
latter, and specifically use 5-grams, based on a separate study 1 in 
which documents were clustered automatically based on similarity 
and this was compared with manual clustering (Table 1). But in 
order to avoid 5-grams that are largely based on common suffixes 
and therefore meaningless, we also use stemming first. 

Note that longer words are represented by more 5-grams in the 
vocabulary vector than shorter ones, which gives them more weight 
in the comparisons. Thus it would be interesting to check if similar 
results would be obtained by using whole words, and weighting 
them according to length. 

In any case, from now on the word “word” will mean a 5-gram. 

2.2 Merging 

After parsing all the concept pages in a topic’s sub-tree, the re- 
sulting vocabulary vectors are merged. The resulting vector in- 
cludes the complete vocabulary of the topic: 

V 0Ct O pi c — V OCpage 1 U V OCpage 2 U ... U k OCpagen - 

The counters indicating how many times each word appears are 
summed as described below. 

2.3 Unification 

In order to compare a query with a set of competitive topics, the 
vocabulary vectors of these topics must span the same space. We 
therefore create a unified vocabulary that includes all the words that 

1 In cooperation with E. Boncheck. 
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appear in any of the competitive topics: 

V0C C ompet — set = VoCtopicl LJ V OOtopic2 U ... U V0Ct o p{ c k • 

We then normalize the vocabulary vectors of the individual top- 
ics to include all these words, by adding the missing ones with a 
count of zero. The resulting normalized vectors will be denoted by 
NormVoctopic- 

2.4 Counters Normalization 

In order to select meaningful keywords, we need to consider the 
number of times each word appears in each topic. As shown in Fig. 
2, these values vary considerably. But they also suffer from the 
scaling effect problem [15]: the counter values in “small” topics 
are generally lower than in “big” topics, leading to an assignment 
of all the keywords to the bigger topics. To compensate for this, we 
need to normalize the counters based on the size of each topic. 

The simplest approach is to divide the counter values by the total 
number of the words in the topic. However, according to ZipFs 
formula [34], rank x count « constant (where the words in 
the text are ranked in order of decreasing count), so the number 
of distinct words in the text grows much slower than their counts. 
Practically, about 50% of the regular text content consists of the 
same 250 words [15]. Therefore this method does not lead to good 
normalization (Fig. 3). 

The most popular algorithm is TFIDF (Term Frequency Inverse 
Document Frequency) [27, 26, 6]. However, this technique does 
not take into account the frequency of term occurrences in other 
documents in the collection, based on the assumption that there are 
very many documents. In our case, we are trying to distinguish 
between a small set of topics, so an adjustment is needed. When 
applied directly, TFIDF did not produce good results (Fig. 3). 

Our chosen approach is to normalize the counters on-the-fly dur- 
ing the previous three steps. Since we are interested in defining a 
topic’s vocabulary, words which occur frequently in one particular 
entry within it should not have a higher weight. Thus, we count 
each word only once for every entry containing it in the concept 
page. For example, given a topic with 5 entries, the maximal weight 
of a word is 5 if it appears in all the entries, but if it appears twice in 
one entry and three times in another, its weight will only be 2. This 
normalization is implemented as part of the parsing algorithm. To 
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deal with the fact that concept pages have different sizes, the coun- 
ters are further normalized by dividing by the number of entries in 
a page or topic. This is done as part of the merging and unification. 

The comparative results of this method are illustrated in Fig. 3. 
As shown in the graph the maximal weights have reached the uni- 
form distribution irrespective of the topic size. 



Heuristic used 


Hit ratio 


extreme values differences 


48% 


two highest values differences 


62% 


if (maxjweight > avg+std_dev) 


87% 



Table 2: Correct classification rate when using alternative 
heuristics for keyword selection. 



therefore to use the difference between the two top counter values 
in step 2. The definition then becomes 

Dif(w) = NormVoct(w) — maxi^t{NormVoa(w )} . 

This version selects the words with significantly greater weight in 
one particular topic than in all the others, but may miss cases in 
which a word has a high count in 2 or 3 topics (which may happen 
as shown in Fig. 4). Specifically, in the BoW corpus the gap be- 
tween the two highest values is the largest in 65-79% of the cases, 
but the gap between the 2nd and 3rd is the largest in another 15- 
22 %. 

Another disadvantage of this heuristic is the percentage of words 
to be chosen as the most significant: we decided to choose an 
empirically-determined 10% threshold, but maybe for other repos- 
itories it will be reasonable to use another threshold. An alternative 
is to choose the most significant words according to their statistics. 
Specifically, we propose to select those words whose counter value 
is larger than the average plus one standard deviation: 

1 . For each word calculate the average counter value: 



2.5 Keywords Selection Heuristic 

A keyword is a word that characterizes a concept and differenti- 
ates one topic from others [15]. Thus, in order to decide whether a 
word is a keyword of some topic, one should consider its frequency 
(weight) in this topic, and also compare with its weights in all the 
competitive topics. The basic idea is that if a word is extremely fre- 
quent in one particular topic and relatively rare in others, then we 
may use it as a keyword for this topic. If a word has similar weight 
in all the topics, then it does not represent any of them, even if its 
weight is high [29]. 

One way to assess the discriminatory power of a word is based on 
the difference between its maximal and minimal counter values in 
different topics in the competitive set. More formally, the algorithm 
is as follows (where NormVoct{w) denotes the counter value for 
word w in the normalized vector of topic f ): 

1. For each topic in the competitive set, find those words that 
achieve their maximal counter value in this topic: 

Max t = {u?|Vz, i ^ t : NormVoct(w) > NormV oa(w)} . 



average(w) = — NormVoa(w) 

n j 



(where n is the number of topics in the competitive set). 
2. Calculate the standard deviation: 



std-dev(w) 






I (N ormV 0Ci(w) — average(w)) 2 . 

l<i<n 



3. If max -weight(w) > average(w) + std-dev(w) then 
the word w is a keyword of the maximal weight topic, other- 
wise it does not represent any topic since it is almost equally 
frequent in all of them. 

To check if the word should be a keyword for other topics as well, 
the highest value is removed and the procedure repeated for the 
remaining topics. 

To compare the above heuristics we used them to classify 200 
entries from the BoW prototype repository. The results are shown 
Table 2, and indicate that the last heuristic (using the average and 
standard deviation) is the best. 



2. For these words, find the range of counter values: 

Vtu, w € Maxt , 

Dif(w) = maxi^t{{NormVoct{w) — NormVoa(w))}. 

3. sort the words in Maxt according to Dif(w) in a descend- 
ing order. 

4. Choose the top 10% of the words (those with the biggest 
difference values) and place them in the keywords vector 
Tkeys t . 

A possible problem with this definition is that the difference can 
be large because the minimal value is very small. An alternative is 



2.6 Optimizations 

2.6.1 Stop-lists 

A well-known optimization in classifications based on lexical 
analysis is the definition of a stop-list — a list of common words 
that should be ignored. In order to generate the list automatically, a 
threshold distinguishing the most common words should be found. 
Numerous studies of documents show that 30% of general English 
text encompassing millions of words is made up of only 18 distinct 
words [15]. Usually, stop-lists contain about 250-300 terms [32, 
25, 10]. However, our repository is limited to a focused scientific 
domain, so its language is rather limited, and may vary among top- 
ics. Thus, the stop-list should contain only those words which are 
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Figure 4: Distribution of word counts in the top level 7 topics for 30 selected 5-grams. The counters are sorted in a descending order. 



common in all the topics in the competitive set. This leads us to the 
following method for stop words identification: 

stop-list = {u>|Vi, NormVoci(w ) > r}, 

where r is an empirically selected threshold on the counter values. 

According to our observation of the common words values dis- 
tribution, the upper value at the top level is greater than at lower 
levels. The best r value is 80 for the highest level of the hierar- 
chy, 60 for the second one, and 50 for the rest, where the average 
maximal counters are 320, 240 and 200, respectively. Thus the em- 
pirically obtained rule is that a stop-words lower bound threshold 
is a quarter of the average maximal frequencies for the given com- 
petitive set of topics. 

Another interesting question is whether the stop-lists at various 
locations in the hierarchy will differ. It is reasonable to expect that 
words like “network”, “software”, and “language” will be impor- 
tant at the highest level of the hierarchy, since each of them leads 
to an appropriate broad topic, such as “architecture and intercon- 
nections”, “operating systems and run-time support”, and “pro- 
gramming, languages, and compilation” (see Fig. 1). Obviously, 
inside the topic “programming, languages, and compilation” the 
words “programming” and “languages” should be the first ones to 
go to the stop-list. However, our observation of the parallel sys- 
tems repository has shown that most of the stop-words at all the 



levels were the same, while for every lower level several additional 
common stop-words were added. The total number of stop-words 
is around 200 with slight differences for various competitive sets. 

2 . 6.2 Special Treatment for Selected Fields 

Another means for optimization is using domain-specific knowl- 
edge. In our case the domain is a bibliographical repository, which 
is classified into topics. Thus special fields like authors names and 
topic titles may carry special significance. 

For example, the topics and sub-topics title fields may be ex- 
pected to reflect the contents of the topic, and this is based on a 
semantic understanding by a human editor. It is therefore desirable 
to use these words as keywords, even if the counter-based algorithm 
described above does not recognize them as such. 

The special treatment of author names is founded on the assump- 
tion that usually scientists tend to concentrate their work in a rather 
narrow area of research. Therefore if several of the given author’s 
publications appear in one specific topic of the competitive set, but 
not in the others, then it is sensible to suggest that the new article 
will also belong to this topic. As most of the author names appear 
too rarely and thus do not survive the keyword filtering process, 
special treatment is required. Just as in the case of topic titles, we 
simply treat author names explicitly as keywords. For this purpose, 
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the first and last names are concatenated and treated as a single 
term. 

2.6.3 Thesauri 

The final major problem to be considered here is the use of sim- 
ilar or related terms (synonyms). Thus the use of thesauri in order 
to recognize variants or to control the vocabulary has been sug- 
gested [3]. A specific feature of our index is that it contains a lot 
of names of projects, systems, and tools, which are often referred 
to by acronyms. Text observations show that typically such terms 
occur in one of the following formats at least once [16]: 

1 . The full term words with capital letters and then the acronym 
consisting of the same first capital letters in parenthesis. 

2. The acronym is followed by the parenthesized full term words 
interpretation. 

Based on this we developed a thesaurus-builder which is responsi- 
ble for lexical text analysis and extracting the full expressions and 
their acronyms, and used it to construct a dictionary of acronyms. 
This was used during parsing to check if the acronym or its interpre- 
tation occur in any particular concept vocabulary, and if so it was 
explicitly entered into the keywords vector. User queries are also 
checked against the thesaurus, and expanded in a similar manner. 

3. ON-LINE SEARCHING 

Given the keyword vectors for all the repository’s topics, those 
matching queries can be found. This is done in two cases: when 
a user issues a search by specifying authors and/or keywords, and 
when a user inserts a new entry into the repository. In this latter 
case, the goal is to recommend topics to which the new entry may 
be linked. 

An important goal is that a retrieved set will be of “reasonable” 
size — large enough to give the user a choice but not too large. 
BoW therefore doesn’t retrieve a set of individual documents in 
response to a query. Instead, it returns whole concept pages. More- 
over, if many of these concept pages belong to the same higher- 
level topic, that topic is returned rather than listing the lower level 
ones. 

3.1 Matching and Ranking 

Matching and ranking go together — we want to find the topics 
that match the query to the highest degree. Several methods for 
such ranking exist [21]. The most popular are based on the TFIDF 
algorithm described in section 2.4 [26, 30, 28, 27] and will be re- 
jected here for the same reasons. An alternative approach which is 
usually used in clustering (e.g. in Isodata Clustering) is to compute 
the distances between the keyword vectors. This can be applied in 
our case, by comparing the distances between the query vector and 
the competitive set vectors. However, the query is typically so short 
that it is not reasonable to weight its terms [11] so the terms rela- 
tive frequencies distance between the query and the index vectors 
is not useful in our model. Thus we have to use a boolean ranking 
method [17], rather than a vector space algorithm. 

Our matching process works as follows: 

1. Check the query data against the acronyms thesaurus, and 
insert both acronyms and their full interpretation into the ini- 
tially empty query vocabulary vector QVoc. 

2. Parse the query (or new entry) and insert the resulting 5- 
grams into the vocabulary vector QVoc (with no terms weights 
considerations). 



3. Starting from the highest level topics, measure the similarity 
of the query to all topics in the competitive set by counting 
the number of common words in the vectors: 

scoretopic =| QVoc DT Keys topic | . 

4. Select the topics with the highest score, and continue recur- 
sively to lower levels. The selection criterion is that the score 
be higher than the average plus a standard deviation, as was 
done in section 2.5. This gives good results because in 84%- 
91% of the queries the biggest gap is between the highest 
and the next topic, or between the second highest and the 
third one (Fig. 5). 

Note that we don’t examine all the tree branches, but only those 
which survive the filtering criteria, thus reducing the computational 
cost. This technique, called tree pruning, was also employed by 
others [14, 20], except that they choose only the single most suit- 
able sub-topic at each level. The main disadvantage of such aggres- 
sive “single-path” pruning is that a failure at one of the higher levels 
will cause all the classification process to fail, whereas pruning that 
keeps two or three branches for further examination attains almost 
the same accuracy as full tree evaluation. Therefore, our ranking 
scheme does not suffer from the irrecoverable errors occurrence 
problem. Choosing more than one also meets our expectation that 
an article may refer to several categories in the bibliography. 

3.2 Output Representation 

Observe that the total number of selected topics may grow expo- 
nentially while descending the tree, if most subtopics are selected 
at each stage. To avoid showing the user such a long list of hits, we 
replace them all by their shared father. As the result, the more gen- 
eral (higher level) topic will be returned to the user. The condition 
for such output compression is that at least 50% of the particu- 
lar topic’s children and more than two of them are in the resulting 
list. The compressing routine is performed recursively from the 
bottom to the root of the index. The results of output compression 
are demonstrated in Fig. 6. The output was compressed for about 
25% of the queries, where the majority of the compressed output 
sets were those including 14 links and more, only 10% of them re- 
mained untouched. On the other hand, only 10% of smaller sets 
(up to 13 links) were compressed. The compression ratio is quite 
big, and the size of compressed output sets was decreased by half 
in average. 

Given the topics selected by the ranking process, and remain- 
ing after output compression, the question is how to display them 
on the screen. The dilemma is how to reconcile two contradicting 
considerations: keep both the concept pages’ topological locations 
in the hierarchy (as in the Berkeley Cha-Cha Search Engine [4]), 
and their respective ranking with regard to this query (as is typi- 
cally done in search engines, e.g. Northemlight [22]). Our solution 
is to display the original index tree, with the selected links opened 
and marked with different colors and font sizes according to their 
relevance to the query. 

4. EVALUATION 

In order to check the final algorithm performance we have con- 
ducted a sequence of 5 experiments employing 7-fold cross valida- 
tion over a corpus of about 3,500 bibliographic entries. The corpus 
is focused on the domain of parallel systems, with an index that has 
an average depth of 5 and an average branching factor of 6. Ev- 
ery experiment was based on about 500 randomly chosen entries, 
which were extracted from the repository. The automatic off-line 
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Figure 5: Distribution of topics scores (in a competitive set of 6 topics) for 30 sample queries. The topics are sorted in a descending 
order for each one. 




Figure 6: Results of the compression procedure for 500 queries 
from one of the 7-fold cross validation experiments (described 
below), sorted in descending order by the uncompressed output 
sets sizes. 



indexing was performed on the remaining 3,000 entries, and the re- 
sulting keyword vectors used to re-insert the 500 entries that were 
extracted. The hit ratio for each case was computed by compar- 
ing the algorithm’s classification of these entries with their original 
manual classifications (Fig. 7). Manually checking those that were 
misclassified revealed that in many cases they were indeed ambigu- 
ous, and had very short annotations that only included very general 
terms. 

Our experimental results have corroborated those of McCallum 
et al. that larger vocabulary sizes generally perform better. For 
larger branches of the index our algorithm selects more keywords, 
and the classification reached its highest accuracy (near 100%). For 
example, the “Operating Systems and Run-Time Support” topic, 
which is one of the biggest topics in the repository with over 7,700 
distinct five-grams vocabulary, got 100% hit ratio, whereas “Algo- 
rithms and Applications” which is a smaller topic, containing about 
3,700 keywords, attained only 92% hit ratio. Another evidence is 
the decrease in hits percentage for lower levels, due to the smaller 
number of entries and therefore the smaller number of keywords, 
as shown in Fig. 8. 

Generally, the results indicate that the more information is avail- 
able about each concept and each query, the better the matching 
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Figure 7: The hit ratios achieved at different levels of the in- 
dex hierarchy, and how they depend on different parts of the 
classification algorithm. 




Figure 8: Distribution of keywords under the 7 top-level topics 
(which are sorted by size). 



that is achieved. However, we find that even a relatively short an- 
notation of 2-3 lines is enough for a reasonably good classification. 

5. CONCLUSIONS 

We have developed and presented the details of a data classifica- 
tion algorithm for effective concept-based storage and retrieval of 
scientific papers in multi-level hierarchical repositories. The three 
main features of the algorithm are its homogeneity, scale indepen- 
dence, and self-updateability. The algorithm is homogeneous in 
that it produces good results at all levels of the hierarchical index, 



and does not depend on the index depth. It is scale independent due 
to the normalization of the keyword vectors, resulting in fair judg- 
ments for various-sized concept pages. It updates the keyword vec- 
tors regularly, thus keeping them current and adjusting to changes 
in the repository contents. This is done at selected intervals, rather 
than on-line for each new entry, because every local change in an 
individual concept page causes changes in the entire topic’s vocab- 
ulary, and so in the selection of keywords across the entire compet- 
itive set; moreover, this effect can propagate up the hierarchy. 

Results of experimentation with the BoW prototype repository 
on parallel systems are very promising. At the top level, nearly 95% 
of the entries were classified correctly, and this dropped to just un- 
der 90% for the lowest levels. Remarkably, this was achieved with 
only the entry details (mainly title and authors), and very short an- 
notations typically between one and three sentences long. There 
was no access to or use of full text. The entries that were mis- 
classified were found to be ambiguous and had short or missing 
annotations. 

In the future we hope to test our algorithm on additional reposi- 
tories. Possible extensions include automatic construction of a full 
thesaurus for all the words and phrases in the given corpus. A big- 
ger challenge is automatic index creation from scratch. Our sug- 
gestion is to use one of the hierarchical clustering methods [12] 
combined with the described automatic indexing algorithm. 
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